When Should You Adjust Standard Errors for Clustering?
00footnotetext: The questions addressed in this article partly originated in discussions with Gary Chamberlain. We are grateful for questions raised by Chris Blattman and seminar audiences, and for insightful comments by Colin Cameron, Vicente Guerra, four reviewers, Larry Katz, and Jesse Shapiro. Jaume Vives-i-Bastida provided expert research assistance. This work was supported by the Office of Naval Research under grants N00014-17-1-2131 and N00014-19-1-2468.| Alberto Abadie | Susan Athey | |
| MIT | Stanford | |
| Guido W. Imbens | Jeffrey M. Wooldridge | |
| Stanford | MSU |
March 14, 2024
Abstract
Clustered standard errors, with clusters defined by factors such as geography, are widespread in empirical research in economics and many other disciplines. Formally, clustered standard errors adjust for the correlations induced by sampling the outcome variable from a data-generating process with unobserved cluster-level components. However, the standard econometric framework for clustering leaves important questions unanswered: (i) Why do we adjust standard errors for clustering in some ways but not others, e.g., by state but not by gender, and in observational studies, but not in completely randomized experiments? (ii) Why is conventional clustering an “all-or-nothing” adjustment, while within-cluster correlations can be strong or extremely weak? (iii) In what settings does the choice of whether and how to cluster make a difference? We address these and other questions using a novel framework for clustered inference on average treatment effects. In addition to the common sampling component, the new framework incorporates a design component that accounts for the variability induced on the estimator by the treatment assignment mechanism. We show that, when the number of clusters in the sample is a non-negligible fraction of the number of clusters in the population, conventional cluster standard errors can be severely inflated, and propose new variance estimators that correct for this bias.
1 Introduction
Imagine you estimated the effect of attending college on labor earnings using linear regression on a cross-section of U.S. workers. How should you calculate the standard error? Empirical studies in economics often report heteroskedasticity-robust standard errors (henceforth “robust”) associated with the work by eicker1963, huber1967behavior, and white1980heteroskedasticity. A common alternative is to report cluster-robust standard errors (henceforth “cluster”) associated with the work by liang1986longitudinal and arellano1987practitioners, with clustering often applied within geographic units such as states or counties. moulton1986random, moulton1987diagnostics and Bertrand2004did have shown that clustering adjustments can make a substantial difference, and since the 1980s cluster standard errors have become commonplace in empirical economics.
Later in this section, we estimate a log-linear regression of earnings on an indicator for some college using data from the 2000 U.S. Census. We find that standard errors clustered at the state level are more than 20 times larger than robust standard errors. Which ones should a researcher report? The conventional framework for clustering [see cameron2015practitioner, mackinnon2021cluster, for recent reviews] suggests that if the clustering adjustment matters, in the sense that the cluster standard errors are substantially larger than the robust standard errors, one should use the cluster standard errors. In this article, we develop a new framework for cluster adjustments to standard errors that nests the conventional framework as a limiting case. The new framework suggests novel standard error formulas that can substantially improve over robust and cluster standard errors in settings like the earnings regression described above.
Our proposed clustering framework differs from the standard one in that it includes a design component that accounts for between-clusters variation in treatment assignments. We argue that the new design component is important because between-cluster variation in treatment assignments often motivates the use of clustered standard errors in empirical studies [see, e.g., gentzkow2008preschool, cohen2010free]. In addition, our framework shifts the focus of interest from features of infinite super-populations/data-generating processes to average treatment effects defined for the finite (but potentially large) population at hand. As a result of this shift, it is the sampling process and the treatment assignment mechanism that solely determine the correct level of clustering; the presence of cluster-level unobserved components of the outcome variable becomes irrelevant for the choice of clustering level. Moreover, by focusing on finite populations (which could be entirely or substantially sampled in the data) we obtain standard errors smaller than those aiming to measure uncertainty with respect to features of infinite super-populations. We derive the large sample variances for the least squares and fixed effect estimators under our proposed framework and show that they differ in general from both the robust and the cluster variances. We also propose two estimators for the large sample variances, one analytic and one based on a re-sampling (bootstrap) approach. For the U.S. earnings application, our proposals produce standard errors that are substantially larger than the robust standard errors, but also substantially smaller than the conventional version of cluster standard errors.
We use our framework to highlight three common misconceptions surrounding clustering adjustments. The first misconception is that the need for clustering hinges on the presence of a non-zero correlation between residuals for units belonging to the same cluster. We show that the presence of such correlation does not imply the need to use cluster adjustments, and that the absence of such correlation does not imply that clustering is not required. The second misconception is that there is no harm in using clustering adjustments when they are not required, with the implication that if clustering the standard errors makes a difference, one should do so. To see that both of these claims are in fact incorrect, consider the following simple example. Suppose that, based on a random sample from the population of interest, we use the sample average of a variable to estimate its population mean. Suppose also that the population can be partitioned into clusters such as geographical units. If outcomes are positively correlated within clusters, the cluster variance will be larger than the robust variance. However, standard sampling theory directly implies that if the units are sampled randomly from the population there is no need to cluster. The harm in clustering in this case is that confidence intervals will be unnecessarily conservative, possibly by a wide margin. A third misconception is that researchers have only two choices: either fully adjust for clustering and use the cluster standard errors, or not adjust the standard errors at all and use the robust standard errors. We show that a combination of the robust and the cluster variance estimators can substantially improve accuracy over its two components.
The new clustering framework in this article has the advantage of providing actionable guidance on a question of substantial consequence for empirical practice in econometrics: When should standard errors be clustered, and at what level? In the conventional model-based econometric framework, the researcher takes a stand on the error component structure of a model for the outcome variable. For example, suppose that, following moulton1986random, moulton1987diagnostics, the researcher posits a random effects model, with random effects at the state level. In this setting, a repeated sampling thought experiment entails that, for each sample, different values of the state random effects are drawn from their distributions. This model-based approach implies that if we are estimating a population mean using a sample average one needs to cluster the standard errors at the state level even if the sample is a random sample of individuals and not a clustered sample. A drawback of the model-based econometric framework for clustering is that empirical researchers need to take a stand on the structure of the error components of their models.
A second, closely related, framework for clustering that is often invoked in the econometrics literature is motivated by a sampling mechanism that in a first stage selects clusters at random from an infinite population, followed by a second stage of random sampling of units from the sampled clusters (or keeping all units in a cluster). Although this framework is appropriate for some applications in the analyses of surveys, where it originated [kish1995survey, thompson2012sampling], we argue that it is not appropriate for many of the data sets economists and other social scientists analyze. In many applications in economics, researchers do observe units from all the clusters they are interested in, e.g., all the states in the U.S., and a framework based on randomly sampling a small fraction of a large population of clusters does not apply.
Neither of the two conventional frameworks for clustered inference described above fully incorporates the design aspect of clustering. And it is the lack of a design component that makes them inappropriate for inference on treatment effects. To gain insight on the importance of the assignment mechanism for the standard errors of treatment effects estimators, consider a setting with individuals sampled at random from a population, but where treatment is assigned at the cluster level, with the same treatment value for all the individuals in the same cluster. Assume that the quantity of interest is the population average treatment effect. Clustered assignment to treatment is equivalent to clustered sampling of potential outcomes. Because the parameter of interest depends on averages of potential outcomes, which are sampled in a clustered manner, clustering of the standard errors is required in this setting, even when the individual observations are sampled at random. Our framework for clustered inference in this setting is close in spirit to the sampling framework described in the previous paragraph, but it incorporates a design component.
By shifting the attention from parameters of a data generating process for the outcomes to the average treatment effect for the population at hand, a researcher applying the proposals in this article does not need to take a stand on the error component structure of a model for the outcome variable to calculate standard errors. Instead, all the relevant variability of the estimator with respect to the average treatment effect is generated by the sampling mechanism, which extracts the sample from the population, and the assignment mechanism, which determines which units are exposed to the treatment. We see this as an intrinsic advantage of the framework proposed in this article in settings where it is difficult to justify a particular error component structure.
In this article we make three contributions. The first one is a novel framework for clustering, building on the one developed by abadie2020sampling for the analysis of regression estimators from a design perspective. We allow for clustering both in the sampling process and in the assignment process. As a result, the framework nests both the traditional case of clustered sampling and the case of clustered treatment assignment in experiments as special cases. It also allows for intermediate cases. In particular, treatment assignment may depend on cluster but not perfectly so, and there remains variation in treatments within-clusters. This framework clarifies the separate roles of clustering in the sampling process and clustering in the assignment process. It also clarifies what we can learn from the data about the need to adjust standard errors for clustering. In our framework, the data are not informative about the need to adjust for clustering in the sampling process, but they are informative about the need to adjust for clustering in the assignment process.
In our second contribution, we derive central limit theorems and large sample variances for the least squares and the fixed effect estimators of average treatment effects that take into account variation both from sampling and assignment. Comparing these variances to limit versions of the robust and cluster variances shows that the robust standard errors are generally too small, and the cluster standard errors are unnecessarily conservative. These comparisons also highlight how heterogeneity in treatment effects affects inference in the estimation of average treatment effects. Often researchers specify models that implicitly assume constant treatment effects without appreciating the implications for inference. We show, however, that heterogeneity in treatment effects introduces additional variance components that affect the need for clustering adjustments.
In our third contribution, we propose new variance formulas and bootstrap procedures for treatment effects estimators in the presence of clustering. We use the term Causal Cluster Variance (CCV) for the analytic variance formulas. For the case of a least squares estimator of average treatment effects, the intuition for the CCV variance formula is as follows. The error of the least squares estimator is approximately equal to a sum, over all units, of an expression involving products of regression errors and regressors values. The robust variance is approximately equal to a sum, over all units, of the squares of these products. In contrast, the conventional cluster variance estimator is approximately equal to a sum, over all clusters, of squares of within-cluster sums of the same products. Although the sum over all clusters of the expectation of the within-cluster sums of these products is zero, the expectation for each cluster separately is not. For each cluster in the sample, it is possible to estimate the expectation of the sum of the products between regression errors and regressors values. The CCV formula uses these estimates to correct the bias of the conventional cluster variance. The CCV correction does not help much if only a small fraction of clusters are sampled. However, when a large fraction of the clusters are represented in the sample, the CCV correction can lead to substantial improvements. This adjustment relies on estimates of cluster-level treatment effects, and thus requires within-cluster variation in treatment assignment. In addition, we propose a bootstrap version of the variance estimator., which we compare to two benchmarks. In contrast to conventional bootstrap procedures, which are based on resampling individual units or entire clusters of units, our proposed Two-Stage-Cluster-Bootstrap (TSCB) conducts resampling in two stages. In the first stage, the fraction treated for each cluster is drawn from the empirical distribution of cluster-specific treatment fractions. In the second stage, the researcher samples the treated and control units from each cluster, with their number of units determined in the first stage. The CCV and TSCB variance estimators are designed for applications with large number of observations and substantial variation in treatment assignment within clusters.
To illustrate the empirical relevance of our results, we analyze a sample from the 2000 U.S. Decennial Census, which includes 2,632,838 individuals. We define 52 clusters according to residency in the 50 states, Puerto Rico, and the District of Columbia. We consider two log-linear regressions of individual earnings on a treatment variable that encodes information on college attendance. In the first specification, the treatment variable is measured as an average, at the state level. In a second specification, we measure college attendance at the individual level.
| Dependent variable: Log labor earnings | ||||
| Panel A | ||||
| Treatment: | State indicator for share of some | |||
| college greater than 0.55 | ||||
| OLS | ||||
| coefficient | 0.1022 | |||
| standard error: | ||||
| robust | (0.0012) | |||
| cluster | (0.0312) | |||
| Panel B | ||||
| Treatment: | Individual indicator for some college | |||
| OLS | FE | |||
| coefficient | 0.4656 | 0.4570 | ||
| standard error: | ||||
| robust | (0.0012) | (0.0012) | ||
| cluster | (0.0269) | (0.0276) | ||
| causal cluster variance (CCV) | (0.0035) | (0.0014) | ||
| two-stage cluster bootstrap (TSCB) | (0.0036) | (0.0014) | ||
In Panel A of Table 1, we report results for a regression where the only explanatory variable is a binary treatment that takes value one if the fraction of individuals with at least some college residing in the state is 0.55 or higher, and value zero otherwise (we chose the 0.55 value to ensure sufficient variation in the treatment over the 52 clusters). Notice that the treatment is constant within states. We report the ordinary least squares (OLS) estimate, as well as robust and cluster standard errors. Since the late 1980s, it has been common practice to report cluster standard errors in settings where the regressors are constant within a cluster. Clustering at the state level makes a substantial difference relative to using robust standard errors, with the cluster standard errors approximately twenty-six times larger than the robust standard errors.
In Panel B of Table 1, the sole regressor is an individual-level indicator for at least some college. In addition to OLS, we report the fixed effects (FE) estimate (with fixed effects for the 50 states, plus Washington DC and Puerto Rico) and robust, cluster, CCV, and TSCB standard errors in parentheses. Like for the regression of the first panel, clustering at the state level makes a substantial difference in the standard errors, with the cluster standard errors approximately twenty-three times larger than the robust standard errors, both for the OLS and the FE regressions. In Panel B, our proposed CCV and TSCB standard errors for the OLS estimate are 0.0035 and 0.0036 respectively, in between the robust standard errors (0.0012) and the cluster standard errors (0.0269), and substantially different from both. The same holds for the FE estimator. The cluster standard error is 0.0276, quite different from the robust standard errors, 0.0012. The CCV and TSCB standard errors are 0.0014, in between robust and cluster but much closer to robust.
2 A Framework for Clustering
In this section, we describe in detail the framework for our analysis. There are multiple components to our set-up that are not explicitly modeled in the usual analysis of the variance of econometric estimators. In general, quantifying the uncertainty of parameter estimates requires describing the population and articulating the assumptions that describe how the sample was generated from that population (that is, building a model for the data generating process). In our framework, there are three distinct sources of sampling variation that lead to variation in the estimates. First, there is variation across samples in which units are observed in each cluster. Second, there is potentially variation in which clusters are observed (which leads to different units being observed). Third, there is variation in the treatment assignment across units. Whereas the standard framework for clustering focuses solely on the first two (sampling) sources of uncertainty, our proposed framework allows for all three. How much these three components matter for the variance of the least squares and fixed effects estimators of the average treatment effect depends on (i) the sampling process, (ii) the assignment process, and (iii) the heterogeneity in the treatment effects across clusters. To facilitate the calculation of asymptotic approximations in a range of relevant settings for empirical practice, it is convenient to formally consider a sequence of populations where we can separately control the fraction of units in the population that are sampled and the fraction of clusters in the population that is sampled, as well as the assignment mechanism.
2.1 A Sequence of Populations
We have a sequence of populations indexed by . The -th population has units, indexed by . The population is partitioned into clusters. Let denote the cluster that unit of population belongs to. The number of units in cluster of population is . For each unit, , there are two potential outcomes, and , corresponding to treatment and no treatment. Thus the population is characterized by the set of triples , for units and clusters . The object of interest is the population average treatment effect
The population average treatment effect by cluster is
Therefore,
We assume that potential outcomes, and , are bounded in absolute value, uniformly for all .
For each unit in the population, we define the stochastic treatment indicator, . The realized outcome for unit in population is . For a random sample of the population, we observe the triple . Inclusion in the sample is represented by the random variable , which takes value one if unit belongs to the sample, and value zero if not. We next describe the two components of the stochastic nature of the sample: the sampling process that determines the values of , and the assignment process that determines the values of .
2.2 The Sampling Process
The sampling process that determines the values of is independent of the potential outcomes and the assignments. It consists of two stages. First, clusters are sampled with cluster sampling probability . Second, units are sampled from the subpopulation consisting of all the sampled clusters, with unit sampling probability equal to . Both and may be equal to one, or close to zero. If , we sample all clusters. If , we sample all units from the sampled clusters. If , all units in the population are sampled. The standard framework for analyzing clustering focuses on the special case where , so only a small fraction of the clusters in the population are sampled. The case and corresponds to taking a relatively small random sample of units from the population. While this is an important special case, there are also many applications where the sampled clusters comprise a large fraction of the overall set of clusters. We refer to the case of as random sampling and to the case of as clustered sampling.
2.3 The Assignment Process
The assignment process that determines the values of also consists of two stages. In the first stage of the assignment process, for cluster in population , an assignment probability is drawn randomly from a distribution with mean , bounded away from zero and one uniformly in , and variance , independently for each cluster. The variance is key. If is zero, then is the same for all , and is randomly assigned across clusters. We refer to this case as random assignment. For positive values of assignment probabilities depend on cluster. Because , it follows that is bounded above by and that the bound is attained when can only take values zero or one, so all units within a cluster have the same values for the treatment. We use the term clustered assignment to refer to the case , when there is no within-cluster variation in . We use the term partially clustered assignment to refer to the case , where assignment depends on cluster but not all units in the same cluster necessarily have the same value of . In the second stage of the assignment process, each unit in cluster is assigned to the treatment independently, with cluster-specific probability .
3 The Least Squares Estimator and its Variance
Let
be the number of treated and untreated units in the sample, respectively; these are random variables. The total sample size is .
We first analyze the OLS estimator of a regression of the outcome on an intercept and the treatment indicator . The OLS estimator (modified so it is well-defined even when or ) is equal to the difference in means:
| (1) |
where and are the maxima of and 1 and of and 1, respectively.
We make the following assumptions about the sampling process and the cluster sizes: (i) , (ii) , and (iii) . The first assumption implies that the expected number of sampled clusters goes to infinity as increases. The second assumption implies that the average number of observations sampled per cluster, conditional on the cluster being sampled, does not go to zero. The third assumption restricts the imbalance between the number of units across clusters. Notice that assumptions (i) and (ii) imply , so the sample size becomes larger in expectation as increases.
3.1 Large Distribution of the Least Squares Estimator
Our first main result derives the large distribution of . Let , , and . Under additional regularity conditions in the Appendix,
where
| (2) |
The expression for the variance has multiple terms that make its interpretation challenging. We first interpret in some special cases to highlight the implications of clustered sampling and clustered assignment. In Section 3.3, we compare to the large- form of the robust and cluster variance estimators.
For the case of random sampling () and random assignment (), the variance simplifies to
As we show in Section 3.2 below, the first term in this variance is estimated by the robust variance estimator. The second term is a finite sample correction that is familiar from the literature on randomized experiments [e.g., neyman1923, imbens2015causal, abadie2020sampling]. This finite sample correction vanishes if there is either no heterogeneity in the treatment effects (so ), or if the sample is a small fraction of the population ().
Adding clustered sampling, , increases the variance by
which is the same as
This term vanishes if there is no heterogeneity in the average treatment effect across clusters. Although the sample is informative about heterogeneity in cluster average treatment effects, it is not informative about the value of . Information about the need to adjust for clustered sampling () must come from outside the sample.
Clustered assignment, , adds two terms to the variance,
As we explain in more detail in section 3.3, the sign of this expression depends on the amount of variation in potential outcomes that can be explained by the clusters. Note that in contrast to the lack of sample information about the need to adjust for clustered sampling, the sample is potentially informative about the need to account for clustered assignment.
The five terms making up the asymptotic variance can be of different order. The first term is an average of bounded terms, and so under our assumptions will be of order . The second and third terms will be at most of the same order as the first one. If so we can think of the sample as small relative to the population of sampled clusters, the first term dominates the second and third terms. If cluster sizes are bounded as increases, the fourth and fifth terms in are also order . If, on the other hand, cluster sizes increase with , these terms can be of higher order and dominate the variance. Whether they do so or not depends on the (i) magnitude of , (ii) presence of clustering in sampling, (iii) presence of clustering in assignment, and (iv) heterogeneity in potential outcomes.
3.2 The Robust and Cluster Robust Variance Estimators
Let be the residuals from the regression of or a constant and . Here, is the intercept of the regression and is the coefficient on (equal to the expression in (1) with probability approaching one).
There are two common estimators of the variance of . First, the conventional robust variance estimator (eicker1963, huber1967behavior, white1980heteroskedasticity):
| (3) |
where
Let
Under regularity conditions (see appendix), and are close in the following sense,
motivating our focus on the comparison of and . In general the difference can be positive or negative, so the robust variance estimator can be invalid in large samples.
The second common variance estimator is the cluster variance [liang1986longitudinal, arellano1987practitioners],
| (4) |
Define
Then, is close to in the sense that
The difference is always nonnegative. Therefore, for large , the cluster variance estimator can be conservative but cannot underestimate the variance of .
3.3 Discussion
From the formulas for , , and it follows that if is small enough, then and are approximately equal to . In this case, clustered sampling and clustered assignment do not matter much because the probability that two sample units belong to the same cluster is small.
The difference depends on two terms. The first term,
| (5) |
is equal to zero when treatment effects are constant (in which case, for and for all ). If all clusters are sampled, so , and treatment effects are heterogeneous, (5) is positive. When only a fraction of the clusters are sampled, , the sign of (5) depends on the extent to which heterogeneity in treatment effects can be explained by the clusters. If there is no variation in average treatment effects across clusters, the expression in (5) is non-negative. However, when clusters explain much of the variation in treatment effects, the expression in (5) can be negative and very large in magnitude because of the factor . The second term of is equal to
| (6) |
This term is equal to zero if there is no clustered assignment, that is, . If , the sign of (6) depends on how much of the heterogeneity in potential outcomes is explained by the clusters. The expression in (6) is close to zero when there is little heterogeneity in potential outcomes, so and are close to zero. If there is heterogeneity in potential outcomes but average potential outcomes are nearly constant across clusters, (6) is positive. When the clusters explain enough heterogeneity in potential outcomes (6) can be negative and potentially very large in magnitude because of the factor multiplying the second term of the sum in (6). That is, the robust variance formula can severely underestimate the variance of .
Clustered standard errors are conservative in general, that is, . In particular, the difference is
which can be rewritten as
| (7) |
When the expected fraction of clusters in the sample, , is small, or when the average treatment effect is nearly constant between clusters, then . Aside from these special cases, the factor in the formula above indicates that cluster standard errors can be extremely conservative in general.
4 Two New Variance Estimators
Estimation of the variance of is challenging because the different terms in can be of different orders of magnitude. In this section, we propose two estimators of the variance of that allow us to correct the bias of the cluster variance estimator, one analytic, and one resampling-based. As the expression for the bias of the cluster variance in (7) shows, the cluster variance is heavily biased if the fraction of the sampled clusters is large and there is substantial variation in the cluster-specific treatment effects. Although the proposed analytic variance estimator is defined irrespective of the value of , in order to for the correction to be effective we need to be able to estimate the cluster-specific treatment effects, and thus we need to be less than its maximum value of to ensure that there is variation in the treatment assignment within clusters. One of the proposed variance estimators is based on a correction to , and the other is based on resampling methods. An alternative would be to directly estimate the bias term in (7) and subtract that from the cluster variance. A challenge with this approach is that the estimation error for the adjustment term is large (often leading to negative variances estimates) because the order of magnitude of the correction is itself large and this approach did not work well in our simulations. We do not report formal results for the variance estimators in the current paper. We demonstrate their performance in the simulations in Section 6. There may well be further refinements possible.
If is close to zero, the proposed variance estimators are close to , which has little bias in that case. If (that is, when is constant within clusters), the proposed resampling variance estimator is not defined. To be effective both variance estimators rely on estimating the variation in treatment effects across clusters, and therefore require a substantial number of both treated and control observations per cluster. The proposed variance estimators lead to substantial improvements over in cases where has a large upward bias. The downside of the proposed variance estimators is that they can be conservative when there is no need to cluster because there is no heterogeneity in treatment effects, or when there are too few treated and control observations per cluster to estimate the heterogeneity in the treatment effects precisely.
We first consider in Section 4.1 the case with so we have random sampling. Next we consider in Section 4.2 the case with clustered sampling . In Section 4.3 we propose a bootstrap procedure for estimating the variance. The proposed variance estimators perform very well in the simulation study of Section 6. The derivation of their formal properties is left for future work.
4.1 The Case with All Clusters Observed
First we focus on the case with (all clusters observed), but allowing for general . Let . The first step is to approximate the normalized error of the least squares estimator by a normalized sample average over clusters,
| (8) |
where the terms
are independent across clusters. In the appendix, we show
| (9) |
The expectation of is
with sum over clusters
| (10) |
That is, although the sum of the expectations of over clusters is equal to zero, these expectations are not equal to zero in general for each cluster separately. Because , the first term on the right-hand side of (9) is conservative on expectation relative to the variance of , which explains the conservativeness of .
Because of (10), we can replace the terms in (8) by , where
| and | ||||
Therefore,
| (11) |
It can be shown that and have means equal to zero and are uncorrelated. In addition, and are uncorrelated across clusters. The variance of is
Let be difference between the sample average of the outcome for treated and nontreated units in cluster . A direct estimator the variance of is
| (12) |
In practice, the estimator in (12) is biased from the correlations between the estimation errors of its components. We apply sampling splitting to address this bias. We first split the sample randomly into two subsamples. Let be the indicator that unit belongs to the second subsample, and let be the mean of . Using the subsample with , we obtain estimates , , and of , , and , respectively. Next, for observations with , we calculate the residuals . Finally, we estimate the normalized variance for the case with as
| (13) |
where is the size of the sample in cluster . For clusters with no variation in the treatment variable, we replace in (13) with . For clusters with no variation in the treatment variable for a particular subsample, we replace in (13) with . We derive the form of the CCV estimator in the appendix. To improve the precision of , we re-estimate it multiple times with new sample splits (new values for ) and then average the corresponding variance estimators. In our simulations of section 6, we re-estimate the variance estimator four times, and use sample splits with in expectation an equal number of units in each subsample, so .
4.2 The Case When Not All Clusters Are Sampled
To motivate the modification of the variance estimator for the case, notice that
where denotes the value of the true variance evaluated at . That is, the variance for the general case is a convex combination of the true variance at and the cluster variance,
Let be the ratio between the number of sampled clusters and the total number of clusters in the population. The proposed variance estimator, , is a convex combination of and with weights and ,
| (14) |
Computation of requires knowledge of , the total number of clusters in the population.
4.3 A Bootstrap Variance Estimator
In the previous sections, we have discussed an analytic variance estimator. Here we suggest a resampling-based variance estimator, initially for the case with . Like the causal bootstrap in imbens2021causal, the proposed bootstrap procedure takes into account the causal nature of the estimand and creates bootstrap samples where units (in this case clusters) have different assignments and assignment probabilities than they have in the original sample. It differs from earlier bootstrap variance estimators for clustered settings [e.g., cameron2015practitioner, menzel2021bootstrap] in that it allows for the possibility that a large fraction of clusters are observed.
The specific resampling procedure, which we call the two-stage-cluster-bootstrap (TSCB), consists of two stages. For each of the clusters, let be the cluster-level sample size and the cluster-level fraction of treated units. In the first stage of the bootstrap procedure, for each cluster we draw with replacement from the empirical distribution of the cluster-level fractions of treated units, that is with probability from the set . In the second stage, we draw units with replacement from the set of treated units in cluster and units with replacement from the set of untreated units in cluster . In order for this to be well-defined we do need all the to be strictly between zero and one. We do this for all clusters to create the bootstrap sample, and calculate the bootstrap standard errors as the standard deviation of the treatment effect estimates across bootstrap iterations.
Next, consider the case with . In this case, we need to take into account the fact that we see a fraction of the clusters in the population. We follow the approach proposed in chao1985bootstrap. Suppose so we observe half the clusters in the population. The bootstrap procedure first creates a pseudo population consisting of the original population of clusters, plus one additional replica of each cluster. Then, to get a bootstrap sample, we sample randomly, without replacement, from the clusters in this pseudo population. Given the clusters in the bootstrap sample, we proceed as before, and ultimately calculate the bootstrap variance as the variance of the estimator over the bootstrap samples. chao1985bootstrap provide details and extensions to the case for the case where is not an integer.
The algorithm for the TSCB is summarized here.
5 The Fixed Effect Estimator
In this section, we report results for the fixed effect estimator often used in empirical research in economics. arellano1987practitioners, Bertrand2004did, cameron2015practitioner and mackinnon2021cluster have pointed out that cluster adjustments may still be necessary in fixed effects regressions. However, a view of clustering based on models with cluster-specific variance components creates ambiguity in the role of clustered standard errors for estimators with cluster fixed effects, which are specifically aimed to absorb cluster-level variation.
We first characterize the fixed effect estimator and derive its large distribution. Then, we discuss the properties of the two conventional variance estimators, the robust and cluster robust variance estimators. As in the least squares case, we find that the robust standard errors may be too small and the cluster standard errors may be unnecessarily large, especially in cases when the number of observations per cluster is large. We propose CCV and TSCB variance estimators. The CCV estimator for fixed effects has a different form than the one for least squares in section A.4.
The fixed effect estimator is based on a regression of the outcome on the treatment indicator and indicators for each of the clusters in the sample. It can be written as the least squares estimate for a regression of the outcome on the treatment, with both variables measured in deviation from cluster means,
| (15) |
Like in section 3, we assume that that potential outcomes are bounded, , and . In addition, we assume (i) , and (ii) the supports of the cluster probabilities, , are bounded away from zero and one (uniformly in and ). Assumption (i) restricts the focus of our analysis in this section to settings where the expected number of sampled clusters is small relative to the expected number of sampled observations per sampled cluster. Together with the previous assumptions, assumption (i) implies , , and . This last result, along with assumption (ii), ensures that in (15) is well-defined with probability approaching one.
Let . For an observation, , with , we define the within-cluster residuals and . Let
| (16) |
where
Under additional regularity conditions, which are described in the Appendix, we obtain the large distribution of the fixed effects estimator,
| (17) |
Let , where , . The robust estimator of the variance of is
| (18) |
Now let,
with
Notice that all terms of are bounded. In the appendix, we show that
The cluster variance estimator for fixed effects is
| (19) |
Let,
with
We obtain in the appendix,
Similar to the least squares case, the robust variance can underestimate the true variance, and the cluster variance is generally too large. Our proposed variance estimator is a convex combination of and and , with the weights selected to correct the bias of the cluster variance estimator as increases (see appendix for details).
| (20) |
where the estimated weight for the cluster variance is
where is an indicator that takes value one if cluster of population is sampled, and is the total number of sampled clusters. The second factor in the second term approximately (that is, ignoring the variance of conditional on ) estimates the variance of divided by its second moment, so that
If there is no variation in within any of the clusters the fixed effect estimator is not defined, and neither is this variance estimator. In all other cases the variance estimator is well-defined.
We also consider a bootstrap standard error, based on the same resampling procedure described in Section 4.3.
6 Simulations
We next report simulation results that illustrate the performance of the proposed variance estimators relative to existing alternatives. To operate in an empirically relevant setting, we create an artificial population based on the Census data briefly described in the introduction, which contains information on log earnings, an indicator for college attendance, and an indicator for state of residence for 2,632,838 individuals.
For each individual in this population of 2,632,838 individuals, we define using state of residence (plus Washington, DC, and Puerto Rico), for a total of 52 clusters. We assign potential outcomes as and , so treatment effects are constant within clusters. We then repeatedly create samples from this population. Creating a sample requires fixing , , and fixing the distribution of and then drawing from the implied distribution for and to generate outcomes for all sampled units. In the baseline design, we set , so we sample all clusters and all individuals in the population. For the assignment mechanism in the baseline design, we convert cluster means of the treatment variable into log-odds, . Let be the average and the sample standard deviation of . We then draw for cluster from a normal distribution with expected value and standard deviation . Given the cluster assignment probability , we assign the treatment in cluster by drawing from a binomial distribution with parameter .
We calculate the standard deviation of the least squares and fixed effect estimators, normalized by the square root of the sample size, , across 10,000 samples drawn according to the procedure outlined above. This is the benchmark against which we compare the various estimates of standard errors. For the least squares and the fixed effects estimators, respectively, we first calculate the (infeasible) asymptotic standard errors and to benchmark the performance of the feasible variance estimators. Next, we calculate the averages across 10,000 simulations of the robust, cluster, CCV, and TCSB standard errors, where we use 100 bootstrap replications in each simulation. Table 6 reports the results. Table LABEL:table:coverage_rates reports coverage rates for 95 percent confidence intervals. In the design column of the two tables is the standard deviation of the cluster average treatment effect.
| normalized standard error | ||||||||
|---|---|---|---|---|---|---|---|---|
| | | | robust | cluster | CCV | TSCB | ||
| Baseline design: , , , | ||||||||
| OLS | 5.91 | 5.90 | 1.90 | 44.86 | 6.32 | 5.80 | ||
| FE | 2.34 | 2.32 | 1.90 | 44.63 | 2.31 | 2.29 | ||
| Second Design: , , , | ||||||||
| OLS | 2.61 | 2.59 | 1.90 | 14.28 | 3.78 | 2.60 | ||
| FE | 1.95 | 1.95 | 1.90 | 14.21 | 1.95 | 1.94 | ||
| Third Design: , , , | ||||||||
| OLS | 14.50 | 14.17 | 1.98 | 56.46 | 13.70 | 14.33 | ||
| FE | 12.14 | 11.89 | 2.13 | 56.79 | 11.61 | 12.07 | ||
| Fourth design: , , , | ||||||||
| OLS | 9.39 | 9.39 | 1.90 | 8.20 | 9.19 | 9.37 | ||
| FE | 2.04 | 2.04 | 2.04 | 1.97 | 2.04 | 2.09 | ||
| Fifth design: , , , | ||||||||
| OLS | 1.95 | 1.97 | 1.97 | 56.42 | 4.53 | 2.04 | ||
| FE | 1.91 | 1.94 | 1.94 | 56.42 | 1.96 | 1.90 | ||
-
•
Notes: s.d. is the standard deviation of the estimators over the simulations, multiplied by the square root of the sample size. is the square root of the asymptotic variance in equation (3.1). is the square root of the asymptotic variance of the fixed effect estimator in (16). The remaining four columns report average values of robust, cluster, CCV, and TSCB standard errors across simulations (multiplied by ). and are the unit and cluster sampling probabilities, respectively. is the standard deviation of the cluster average treatment effect. is the standard deviation across clusters of the treatment assignment probabilities.
For the baseline design, the normalized standard deviation of the least squares estimator is 5.91. This is well approximated by the asymptotic standard error, 5.90. The robust standard error is on average over the simulations 1.90, less than one-third of the normalized standard deviation of the estimator. The cluster standard error is far too large, on average 44.86, more than seven times the value of the normalized standard deviation. CCV improves considerably over robust and cluster. The average CCV standard error is 6.32, about 7 percent higher than the normalized standard deviation. The TSCB standard error is the most accurate, on average equal to 5.80. For the fixed effect estimator, the asymptotic standard error is again accurate. The robust standard error is about 16 percent too small, leading to a coverage rate for the nominal 95 percent confidence interval of 0.89 in Table LABEL:table:coverage_rates. The cluster standard error is too large by a factor of 20. CCV and TSCB standard errors closely approximate the normalized standard error.
When Should You Adjust Standard Errors for Clustering? Alberto Abadie, Susan Athey, Guido W. Imbens, and Jeffrey M. Wooldridge Current version: \Filemodtodayappendix
A.1 Setting and notation
We have a sequence of populations indexed by . The -th population has units, indexed by . The population is partitioned into strata or clusters. Let denote the stratum that unit of population belongs to. The number of units in cluster of population is . For each unit, , there are two potential outcomes, and , corresponding to treatment and no treatment. The parameter of interest is the population average treatment effect
The population treatment effect by cluster is
Therefore,
We will assume that potential outcomes, and , are bounded in absolute value, uniformly for all .
We next describe the two components of the stochastic nature of the sample. There is a stochastic binary treatment for each unit in each population, . The realized outcome for unit in population is . For a random sample of the population, we observe the triple . Inclusion in the sample is represented by the random variable , which takes value one if unit belongs to the sample, and value zero if not.
The sampling process that determines the values of is independent of the potential outcomes and the assignments. It consists of two stages. First, clusters are sampled with cluster sampling probability . Second, units are sampled from the subpopulation consisting of all the sampled clusters, with unit sampling probability equal to . Both and may be equal to one, or close to zero. If , we sample all clusters. If , we sample all units from the sampled clusters. If , all units in the population are sampled.
The assignment process that determines the values of also consists of two stages. In the first stage, for cluster in population , an assignment probability is drawn randomly from a distribution with mean , bounded away from zero and one uniformly in , and variance , independently for each cluster. The variance is key. If it is zero, we have random assignment across clusters. For positive values of we have correlated assignment within the clusters. Because , it follows that is bounded above by and that the bound is attained when can only take values zero or one (so all units within a cluster have the same values for the treatment). In the second stage, each unit in cluster is assigned to the treatment independently, with cluster-specific probability .
A.2 Base case: Difference in means
Let
be the number of treated and untreated units in the sample, respectively. The total sample size is . We consider the simple difference of means between treated and non-treated, which is obtained as the coefficient on the treatment indicator in a regression of the outcome on a constant and the treatment,
We make the following assumptions about the sampling process and the cluster sizes: (i) , (ii) , and (iii) . The first assumption implies that the expected number of sampled clusters goes to infinity as increases. The second assumption implies that the average number of observations sampled per cluster, conditional on the cluster being sampled, does not go to zero. The third assumption restricts the imbalance between the number of units across clusters. Notice that assumptions (i) and (ii) imply , so the sample size becomes larger in expectation as increases.
A.2.1 Large distribution
Let and , , and . Notice that,
This implies
where
, , and . We will first derive the large sample distribution of
where
and
Notice that . Moreover, notice that the terms are independent across clusters, . In addition,
and
We obtain:
Therefore,
Let , then
Alternatively, we can write this expression as
The sum of the first three terms is minimized for and , in which case this sum is equal to zero. Therefore,
| (A.1) |
We will assume that , so either sampling or assignment or both are correlated within cluster. (We study the case and separately below.) In addition, assume (i) and
| (A.2) |
or (ii) and
| (A.3) |
Equation (A.2) would be violated if, as increases, there is no variation in average treatment effects across clusters. Equation (A.3) would be violated if as increases there is no variation in average potential outcomes across clusters. If equations (A.2) and (A.3) hold, is bounded below by a term of order at least . Recall our assumption, , so the average number of observations sampled per cluster, conditional on the cluster being sampled, does not go to zero. Then,
To obtain a CLT, we will check Lyapunov’s condition,
for some . Because potential outcomes are uniformly bounded and is uniformly bounded away from zero, we obtain
where is some generic positive constant, whose value may change across equations. Consider , and let
for . (The second and third terms on the left-hand side of last equation only appear when and , respectively) As a result,
Because , for large enough we obtain,
and the same bound applies for . Notice that
Now, Hölder’s inequality implies that
| (A.4) |
is sufficient for the Lyapunov condition to hold. Because is bounded asymptotically, we obtain,
and so the Lyapunov condition holds. As a result, we obtain
We will next prove that both and are .
Therefore,
Because , we obtain . As a result, is .
Let . Consider large enough, so is bounded away from zero, making well-defined. Notice that and
This implies . Analogous calculations yield . For large enough , if and only if , which implies . It follows that, for large enough ,
and . Using analogous calculations, we obtain . As a result,
Therefore,
Using and , it is easy to show , which implies
We will next consider the case of and , where no clustering is required. Consider
and
Redefine now . Then,
Notice that is minimized for , in which case
Therefore, the assumption
is enough for . Notice now that
and the same bound holds for . Therefore, for the Lyapunov condition to hold, it is enough that
or . That is, assumptions (i)-(iii), which we used for the clustered case, are replaced by .
A.2.2 Estimation of the variance
Let be the residuals from the regression of or a constant and . Here, is the coefficient on the constant regressor equal to one, and is the coefficient on . We have already shown . The same is true about (e.g., apply the proof for after replacing each with a zero). Define , where
Also, let
and . Then, the cluster estimator of the variance of is
Notice that
In addition,
Therefore, under conditions (i)-(iii), we obtain
Analogous calculations yield . Therefore,
and
Now, let . Notice that
Define , where
We will show
Notice that
Therefore,
The same expression holds for the off-diagonal elements of . For , the expression holds once we replace each with a one. Let be the Frobenius norm of a matrix. Then,
We will prove that the right-hand side of the previous equation converges to zero in probability. We will factorize each term into a expression that is bounded in probability and one that converges to zero in .
For the first term, notice that
For the second term, using the fact that is greater or equal to times a term with limit inferior that is bounded away from zero, we obtain
As a result, we obtain
Notice that
Therefore, to show that the left-hand side of the last equation is , it is only left to show that is . We will prove this next. Notice that
Therefore,
Then,
We, therefore, obtain,
Because , we obtain
Recall that . Notice that
Let
Then,
Alternatively, we can write
We will next show that
Given the is bounded away from zero, by the weak law of large numbers for arrays, it is enough to show
Applying the multinomial theorem and the fact that all moments of as well as all potential outcomes are bounded, we obtain:
Now, using , , and we obtain
As a result,
The robust (sandwich) estimator of the variance of is given by
where
We will derive the limit of . Let
Because potential outcomes (and ) are bounded, we obtain
Because the limsup of the expectation of the first factor (which is non-negative) is bounded and the second factor converges to zero in probability as proved above, we obtain
Notice that
Again, the limsup of the expectation of the right-hand side of this equation is non-negative and bounded. As a result, we obtain .
Notice that
Finally, notice that
Therefore, by the weak law of large numbers for arrays, we obtain
where
A.3 Fixed effects
A.3.1 Large distribution
Let
and
| (A.5) |
where
Notice that we need for this estimator to be well-defined in large samples (otherwise, the denominator in the formula for could be equal to zero). Although it is not strictly necessary, and because it entails little loss of generality and simplifies the exposition, we will assume that the supports of the cluster probabilities, , are bounded away from zero and one (uniformly in and ). In finite samples we assign to the cases when the denominator of in equation (A.5) is equal to zero. Notice that
Let
, and . It follows that
Now, . Then,
Let
| (A.6) |
where, as before, we make if the denominator on the right-hand side of (A.6) is equal to zero. Now, , where
and
Notice that outcomes enter the term only through the intra-cluster errors, and . In contrast, the term depends on outcomes only through inter-cluster variability in treatment effects, . The numerator in the expression for in the last displayed equation does not have mean zero in general, and this will be reflected in a bias term, , which we define next. Let,
and
Then, , where
| and | ||||
The terms and depend on the within-cluster errors and . The terms and depend on the inter-clusters errors . and replace with , while and correct for the difference, .
It can be seen (in intermediate calculations below) that
and
These two expectations are substracted in and , so and have mean zero. Doing so for does not require adjustments elsewhere. Because
the terms do not change the sum . In contrast, demeaning creates the bias term . If the size of the clusters does not vary across clusters, then is equal to zero. More generally, . Therefore, if
| (A.7) |
(that is, if the expected number of sampled clusters is small relative to the expected number of sampled observations per sampled cluster) then converges to zero. As a result, converges in probability to zero, because, as we will show later, converges in probability to , which is bounded away from zero. In our large sample analysis, we will assume that the expected number of sampled clusters grows to infinity, . Then, equation (A.7) implies that the expected number of observations per sampled cluster goes to infinity, . Notice also that .
We summarize now the assumptions we made thus far. We first assumed that the supports of the cluster probabilities, , are bounded away from zero and one (uniformly in and ), and that potential outcomes are bounded. Moreover, we assumed and . These imply and . We will add the assumption that the ratio between maximum and minimum cluster size is bounded, . This assumption implies and .
We will now study the behavior of . Notice that
In addition,
The weak law of large numbers for arrays implies
Because and , we obtain
We now turn our attention to . We will first calculate the variance of . Let be a binary variable that takes value one if cluster in population is sampled, and zero otherwise. Notice that
and
Consider now
and
It holds that and . Now, notice that
| and | ||||
Therefore,
and
| (A.8) |
We will next show that the terms do not matter for the asymptotic distribution of . Notice that, because the cluster sum of is equal to zero, we obtain and, therefore,
Moreover
In addition, (see intermediate calculations). Therefore,
Now, because errors are bounded, we obtain
| (A.9) |
Because , the weak law of large numbers for arrays, implies,
with the analogous result involving the errors . If follows that
Consider now . Notice that
and
Therefore,
and
Next, we calculate the variance of . Using results on the moments of a Binomial distribution, we obtain, for ,
Therefore,
where
It follows that,
Therefore,
We will now study the covariance between and . Using results on the moments of a Binomial distribution, we obtain, for ,
Therefore,
In addition,
As a result,
Notice that . In addition, and
Therefore,
Next, we will study the remaining covariances between , , , and . Because the intra-cluster errors, and sum to zero, it can be easily seen that . It can also be seen that the inter-clusters sums of covariances between and any of the other terms go to zero. To prove this for the covariance with , we have
The same argument and result applies to and . Putting all the pieces together, we obtain
where
Collecting terms with identical factors, we obtain
The first three terms in the expression above depend on intra-cluster heterogeneity in potential outcomes and treatment effects. The last two terms depend on inter-cluster variation in average treatment effects.
A more compact expression for is
| (A.10) |
Notice that the first four terms in (A.3.1) are bounded, and that
Assume that
| (A.11) |
and
| (A.12) |
The last term in equation (A.3.1) is greater than
which converges to infinity because . That is, the last term dominates the variance in large samples provided that (A.11) and (A.12) hold.
We will now derive the large sample distribution of . To show that Lyapunov’s condition holds for , notice that
where the last term inside the absolute value comes from the bias correction. Notice that,
From the formula of the third moment of a binomial random variable, we obtain
as . Now,
Similar calculations deliver the analogous result for the term involving , and proving the result for the bias term is straightforward. Therefore, we obtain
By the Central Limit Theorem for arrays, this implies
Let . Then,
As a result,
A.3.2 Estimation of the variance
Let
| and | ||||
Let
Then,
where
| and, as before, | |||
Let , where , , and is the within estimator of . Let , where
Also, let
Then, the cluster estimator of the variance of is
We know already that
with bounded away from zero. To establish convergence of , first notice that, for , we have
For and , let
and let for and . Then, for and , we have
Then,
Using the formula for the second moment of a binomial distribution and , we obtain,
From the formula of the sum of the first two moments of a binomial distribution, we obtain
Therefore,
Now, notice that
Equation (A.9) (and the analogous result for the sum involving terms with ), implies
As a result, it is enough to establish convergence of , where
We will next show that
| (A.13) |
where
Let
Using the result in equation (A.3.1) and results on the moments of the binomial distribution (see intermediate calculations in section A.7), we obtain
Therefore, to show that equation (A.13) holds, we will show
| (A.14) |
Let
and
Let
Then,
Therefore, because and , we obtain
| (A.15) |
Using the same argument, we obtain
| (A.16) |
where
Notice that equations (A.3.2) and (A.16) imply
and
Notice that the last two equations hold even if is bounded (e.g., when for all and ), as long as is bounded away from zero in large samples. In section A.3.3 we derive conditions so that is bounded away from zero in large samples even if for all and . Now, let
Recall that, under the conditions in (A.11) and (A.12), and is bounded for large and, therefore, is bounded for large . Then (see intermediate calculations at the end of this document), for large ,
Now, Hölder’s inequality implies that equation (A.14) holds (see intermediate calculations).
Now let,
We obtain,
We will next establish the analogous result for the heteroskedaticity-robust variance estimator. Let
Then, the heteroskedasticity-robust estimator of the variance of is
As we have established before,
For and , let
and let for and . Then, for and , we have
and
| (A.17) |
Focusing on the part of the right hand side of last equation that depends on the first term of , we obtain
We will focus now on the part of the right-hand side of equation (A.17) that that depends on the second term of ,
Using the formula for the variance of a sample mean under sampling without replacement [e.g., in the supplement of abadie2020sampling], we obtain for ,
| (A.18) |
where
Because is bounded, so is the right-hand side of equation (A.3.2). As a result
An analogous derivation applies to the part of the right-hand side of equation (A.17) that depends on the third term of . (Notice that and that is equal to minus the difference between and the analogous difference for the treated.
Therefore, we will study the behavior of
| (A.19) |
First, notice that
| (A.20) |
The inside of the first square root in equation (A.3.2) is bounded by a constant times
which converges in probability to one. The expectation of the inside of the second square root in equation (A.3.2) is
As a result, the right-hand side of equation (A.3.2) converges to zero in probability. The derivation with replacing in equation (A.3.2) is analogous. Now, notice that
Because the first factor of the expression above is bounded, we obtain
| (A.21) |
Now, the right-hand side of equation (A.3.2) converges to zero in probability by the same argument as for equation (A.3.2). Cauchy-Schwarz inequality implies,
where
| (A.22) |
for and , and for . Therefore, we will study the behavior of
We know,
and
Now, notice that
which implies
and
Notice now that
Notice also that expectations of the sums of products of the terms on the right-hand side of equation (A.22) are equal to zero. Then,
where
Now let,
We obtain,
A.3.3 Large results the fixed effects case under homogeneous average treatment effects across clusters
We will now study the Lyapounov’s condition for the case for all and , so
Notice that
Therefore,
is sufficient for (even if condition (A.11) does not hold). Given our assumption that the supports of the cluster probabilities, , are bounded away from zero and one (uniformly in and ), then
| (A.23) |
is sufficient for . Assume that (A.23) holds, so . We now obtain,
and
The first equality holds because the terms sum to zero within clusters. The second equality holds because, if , with , then and are independent conditional on , and . Notice that
which also implies . As a result,
A.4 Derivations of the variance estimators
In this section, we derive the adjustments in the CCV variance. (We do this under the assumption that the are independent. In our simulations we actually use a slightly different sampling scheme for the where the average is identical and fixed in each cluster.) To derive the CCV variance of the least squares estimator, consider first a variance estimator of the form
We aim, however, to design an estimator based on a subsample consisting of units with , where is i.i.d. binary with and independent of . First, notice that
and
Therefore,
and
Adding the last two equations,
| (A.24) |
The first term of the CCV variance estimator for least squares is based on the sample counterpart of the right-hand side of equation (A.24), with in the role of .
To derive the CCV variance estimator for the fixed effect case, consider
and let . This transformation is designed to reproduce the terms in with factor
These terms dominate as increases. It also reproduces several lower order terms.
Notice that
Then,
For , we obtain,
| (A.25) |
The difference is non-negative and of smaller order than . Therefore, (even if is bounded away from zero). The first term on the right-hand side of (A.25) could be estimated to further correct the difference between the CCV estimator and the variance of .
A.5 Limit results
Let be an infinite array of random variables, with rows indexed by , and the columns of the -th row indexed by . Let
and .
A Weak Law of Large Numbers for Arrays: For each , suppose that are independent and have finite second moments. In addition, let be a sequence of positive constants such that
Then,
Proof: By Chebyshev’s inequality, for any
A Central Limit Theorem for Arrays: For each , suppose that are independent, with zero means, , and finite variances, , for . Let
Assume also that Lyapounov’s condition holds,
for some . Then,
Proof: billingsley, Chapter 27.
A.6 Intermediate calculations for Section A.2
The calculation of uses the following results.
and
Similarly,
Notice also that
and
The following bounds are useful to prove Lyapunov’s condition.
Let be a binary indicator that takes value one if cluster of population is sampled.
Other useful intermediate calculations.
For the moments of treatment indicators, notice that , and . In addition,
Similarly, . Therefore, and . In addition,
Similarly,
and
, . Moreover,
Recall that . Therefore, . Also,
Similarly,
Therefore,
| and | ||||
In addition,
A.7 Intermediate calculations for Section A.3
This implies
Therefore,
For ,
Therefore,
For
Because , we obtain
which implies
Therefore,
Conditional on and , the variable has a binomial distribution with parameters . Then, using the formulas for the moments of a binomial distribution, we find that for any integer , such that ,
where and are uniformly bounded in . Therefore,
It follows that
Therefore,
Notice that,
Therefore,
uniformly in .
Suppose . Let and . Now suppose,
and
Using the binomial theorem and Hölder’s inequality, we obtain